Data needs structure!

When we start using digital tools, the first thing we have to find out is what kind of data they expect and how it needs to be structured. Does the tool want a plain old text? Several plain old texts? Some extra information besides?

Many tools, including the one we'll look at today, use the same sorts of variables, lists, and dictionaries that we've been learning about in Python. When they speak to each other they often express these variables, lists, and dictionaries in a fornat called JSON.


In [1]:
import json

JSON stands for "JavaScript Object Notation", and was so amazingly simple and useful that it is used for many many things that have nothing to do with JavaScript.

The idea is that everything is an object. There are simple objects, and there are complex objects, but they are all objects. And pretty much anything can be represented this way!

Simple objects

Simple objects are things like numbers, letters, words, entire sequences of words, and Boolean values i.e. true or false. JSON represents these...exactly as they are.


In [2]:
print("A number", json.dumps( 1.234 ))
print("A string", json.dumps( 'Message in %d bottles' % 1000 ))
print("A Boolean", json.dumps( 2 + 2 == 4 ))


A number 1.234
A string "Message in 1000 bottles"
A Boolean true

It's worth noticing what it did with that bottled message though - it was printed with double quotes around it. This is part of the JSON specification - strings, which is to say simple objects that are neither numbers or Booleans, are wrapped in double quotes. Compare the JSON-dumped and plain version...


In [3]:
print("My normal string -", 'Message in %d bottles' % 1000)
print("My JSON string -", json.dumps( 'Message in %d bottles' % 1000 ))


My normal string - Message in 1000 bottles
My JSON string - "Message in 1000 bottles"

So what happens if your message itself has a quotation?


In [4]:
print("A string -", json.dumps( '"Programming is great!" exclaimed Alice.' ))


A string - "\"Programming is great!\" exclaimed Alice."

Double-quotes within a double-quoted string are managed by putting the superpower sign \ (backslash) in front of them. And if your string has a backslash? Put a backslash in front of it.


In [5]:
backslash_string = '"What would we do without the \ character?", Susan mused.'
print(json.dumps( backslash_string ))


"\"What would we do without the \\ character?\", Susan mused."

It's not the prettiest thing ever, but it works! And then if you have some JSON, you can read it again with the .loads() function.


In [6]:
json_string = json.dumps( backslash_string )
print(json.loads( json_string ))


"What would we do without the \ character?", Susan mused.

Complex objects

Okay, but we have an entire text and all sorts of information we want to encode, and if we were happy to just throw it in between double quotes, we wouldn't be here. We need complex objects. They include:

  • A list of objects
  • A dictionary of objects

For starters, we might want to indicate that our text is not just a long line, but individual words. We can make it a list.


In [7]:
story = "It was a dark  and stormy night."
## We want to get each word. We could do it the hard way...
storywords = [ 'It', 'was', 1, 'dark', 'and', 'stormy', 'night.' ]
print("Try #1:", json.dumps( storywords ))


Try #1: ["It", "was", 1, "dark", "and", "stormy", "night."]

But why do we learn Python if not to make things easy for ourselves? Let's make the same list, the easy way - splitting up the story according to the spaces.


In [8]:
storywords = story.split()
storywords[2] = 1
print(storywords)
print("Try #2:", json.dumps( storywords ))


['It', 'was', 1, 'dark', 'and', 'stormy', 'night.']
Try #2: ["It", "was", 1, "dark", "and", "stormy", "night."]

Notice the JSON rules for a list:

  • It is surrounded by [] (square brackets).
  • Its elements (which can be any object, simple or complex) are separated by commas.

And so now we can send a list to someone, and they can send a list to us. We convert JSON back into data with the .loads() method. This is important - JSON itself is just a character string, and doesn't become a list until we load it!


In [9]:
json_string_from_elsewhere = '["So", "long", "and", "thanks", "for", "all", "the", "fish."]'
python_string = 'Hello'
print("Second word is", json_string_from_elsewhere[1])


Second word is "

See? You might have expected the second element in the list, but instead you got the second character in the string. Let's try that again after loading the string.


In [10]:
wordlist = json.loads( json_string_from_elsewhere )
print("Second word is", wordlist[1])


Second word is long

But we might even want to describe something even more complex - like the fact that each thing in the list is a "word", and maybe even the word number. Sure, we can figure that out by looking at the list and using our common sense, but computers don't have common sense, and maybe later we will want to do something with the words that involves mixing up their order.

So for each word, let's make a little dictionary to say that the "word" is whatever the word is, and the "sequence" shows the order of the words. If we felt like it we could add more information like the word's root form, or whether it had punctuation before or after, or anything we like.


In [11]:
storywords = []
counter = 1
for w in story.split():
    wordinfo = { 'word': w, 'sequence': counter }
    storywords.append( wordinfo )
    counter += 1
print(storywords)
print(json.dumps( storywords ))


[{'sequence': 1, 'word': 'It'}, {'sequence': 2, 'word': 'was'}, {'sequence': 3, 'word': 'a'}, {'sequence': 4, 'word': 'dark'}, {'sequence': 5, 'word': 'and'}, {'sequence': 6, 'word': 'stormy'}, {'sequence': 7, 'word': 'night.'}]
[{"sequence": 1, "word": "It"}, {"sequence": 2, "word": "was"}, {"sequence": 3, "word": "a"}, {"sequence": 4, "word": "dark"}, {"sequence": 5, "word": "and"}, {"sequence": 6, "word": "stormy"}, {"sequence": 7, "word": "night."}]

Okay! The rules for dictionary objects are slightly more complex:

  • Each dictionary is surrounded by {} (curly quotes).
  • The dictionary has a set of keys and a set of values.
  • Each key must be a string (i.e. a simple JSON object that goes into double quotes).
  • Each value can be any object, simple or complex.

So what we have here is a list of dictionaries - each dictionary has a 'word' and a 'sequence'.

By now you'll also have noticed that these JSON concepts, and the way JSON writes them out, are pretty similar to what you've learned about how to make strings and lists and dictionaries in Python! This is no accident. What it means is that you can put your data into Python, and then use json.dumps() to serialize it (that is, get it in a form that can be sent to someone else) and then, when you get data back from them, you can use json.loads to read it again.

But then, you ask, why not just use the plain old Python objects and send those around? Answer: because not everyone uses Python to program - in a minute we are going to talk to a server that is written in Java! Second answer: because one of the rules of programming is that the objects you make within your program cannot be directly sent outside your program. Your program's objects are the direct thoughts and pictures in its brain, and other programs can't simply read your program's mind - they have to communicate by speaking words or drawing pictures that can be passed around.

Why dumps and loads? It's a common metaphor. I carry around a bunch of information and then dump it in someone else's lap; that person loads it into his or her wheelbarrow (program) and carries it somewhere else.

JSON for Text Collation

We have a collation program that will take a set of variants of a text and align them, giving us a result that shows where the texts differ and where they are the same. This is a very important thing to do if you are working on philology, or a critical edition of a text.

The collation program accepts JSON input that tells it what the witnesses are (that is, the different text variants) and returns an answer with the witnesses all aligned. We can try it out! There are two ways to do this, and the first is pretty straightforward, so that's what we will use. It wants a dictionary with the key witnesses, whose value is a list of dictionaries. Each dictionary (that is, each witness) has an id and some content.

{ "witnesses": [ { "id": "A", "content": "This is the first text" }, { "id": "B", "content": "This is the second text" } ] }

Let's build that up in Python!


In [12]:
first_witness  = "Du kennst von Alters her meine Art, mich anzubauen, irgend mir an einem vertraulichen Orte ein Hüttchen aufzuschlagen, und da mit aller Einschränkung zu herbergen."
second_witness = "Du kennst von Altersher meine Art, mich anzubauen, mir irgend an einem vertraulichen Ort ein Hüttchen aufzuschlagen, und da mit aller Einschränkung zu herbergen."
third_witness  = "Du kennst von Altersher meine Art, mich anzubauen, irgend an einem vertraulichen Ort ein Hüttchen aufzuschlagen, und da zu herbergen."

witnesses = []
witnesses.append( { "id": "Mü375", "content": first_witness  } )
witnesses.append( { "id": "V887", "content": second_witness } )
witnesses.append( { "id": "Oxford", "content": third_witness  } )
collation_input = { "witnesses": witnesses }

print(json.dumps( collation_input ))


{"witnesses": [{"id": "M\u00fc375", "content": "Du kennst von Alters her meine Art, mich anzubauen, irgend mir an einem vertraulichen Orte ein H\u00fcttchen aufzuschlagen, und da mit aller Einschr\u00e4nkung zu herbergen."}, {"id": "V887", "content": "Du kennst von Altersher meine Art, mich anzubauen, mir irgend an einem vertraulichen Ort ein H\u00fcttchen aufzuschlagen, und da mit aller Einschr\u00e4nkung zu herbergen."}, {"id": "Oxford", "content": "Du kennst von Altersher meine Art, mich anzubauen, irgend an einem vertraulichen Ort ein H\u00fcttchen aufzuschlagen, und da zu herbergen."}]}

(Did you notice how the "content" and "id" got swapped around from the order in which we wrote them? This is an important difference between dictionaries and lists - for dictionaries, it doesn't matter what order the information comes in!)

Now we can send that to the collator and get an answer back. It will assume that words are separated by spaces.

But before we do that, we need to tell the collation program a few extra things:


In [13]:
collation_input['tokenComparator'] = {'type':'equality'}
collation_input['algorithm'] = 'dekker'

Hang onto your hats - we are going to use the urllib module, which lets us send JSON to a program running on someone else's webserver, and get an answer back. Don't worry too much about the details at this point - just know that this is possible, and it is the sort of thing that JSON was invented for. We have our data structure (that is, our witnesses and their content) and we have serialized them into a JSON string so that we can send them away to the webserver, and we'll get a different JSON string back that we can deserialize into collation data.


In [14]:
from urllib import request

request_data = json.dumps(collation_input).encode('utf-8')
collation_headers = { 
    'Content-Type': 'application/json', 
    'Accept' : 'application/json' 
}
webrequest = request.Request( 'http://collatex.net/demo/collate', 
                             request_data, collation_headers )
webanswer = request.urlopen( webrequest )
print(webanswer.getcode())
json_answer = webanswer.read().decode('utf-8')
print(json_answer)


200
{"witnesses":["Mü375","Oxford","V887"],"table":[[["Du ","kennst ","von "],["Du ","kennst ","von "],["Du ","kennst ","von "]],[["Alters ","her "],["Altersher "],["Altersher "]],[["meine ","Art",", ","mich ","anzubauen",", "],["meine ","Art",", ","mich ","anzubauen",", "],["meine ","Art",", ","mich ","anzubauen",", "]],[[],[],["mir "]],[["irgend "],["irgend "],["irgend "]],[["mir "],[],[]],[["an ","einem ","vertraulichen "],["an ","einem ","vertraulichen "],["an ","einem ","vertraulichen "]],[["Orte "],["Ort "],["Ort "]],[["ein ","Hüttchen ","aufzuschlagen",", ","und ","da "],["ein ","Hüttchen ","aufzuschlagen",", ","und ","da "],["ein ","Hüttchen ","aufzuschlagen",", ","und ","da "]],[["mit ","aller ","Einschränkung "],[],["mit ","aller ","Einschränkung "]],[["zu ","herbergen","."],["zu ","herbergen","."],["zu ","herbergen","."]]]}

So what have we here? A dictionary object with two keys, 'witnesses' and 'table'. The table is a list of lists - each row in the table becomes a list reading left to right, and the table itself is the list of rows. In this case, each table cell is itself a list of words. So it's a slightly more complex table than usual...

In this case, the 'witnesses' key is really the first row of our collation table - a list of our witness IDs in the order they appear in the rest of the table rows. First let's turn the JSON into real data:


In [15]:
collation = json.loads( json_answer )

...and then let's print out the table as HTML. For this we will make use of IPython's ability to spit out HTML, and we will make it so that each witness takes up a row. You don't have to pay too much attention to how this is done if you don't know HTML, but it lets you see how we can format, slice, dice, and rearrange data.


In [16]:
from IPython.core.display import HTML

witness_rows = []

# Start the rows with the witness IDs as headers
for witness in collation['witnesses']:
    row_html = '<th>%s</th>' % witness
    witness_rows.append( row_html )

# Now for each set of matching words in the collation, add them to their
# respective witness rows
for row in collation['table']:
    for index, cell in enumerate( row ):
        cell_html = '<td>%s</td>' % ' '.join( cell )
        witness_rows[index] += cell_html
    
# Make the table with all the rows
collation_table = '<table>'
for html_row in witness_rows:
    collation_table += '<tr>%s</tr>' % html_row
collation_table += '</table>'

# Display it
print("This is the plain HTML that we get from our code above.")
print(collation_table)   # This is what the HTML looks like. Pointy, eh?

## Do NOT try this line in PyCharm! It only works in this IPython notebook.
print("\nAnd this is what that HTML turns into in a web browser.")
HTML(collation_table)    # This is what the browser does with all those pointy brackets


This is the plain HTML that we get from our code above.
<table><tr><th>Mü375</th><td>Du  kennst  von </td><td>Alters  her </td><td>meine  Art ,  mich  anzubauen , </td><td></td><td>irgend </td><td>mir </td><td>an  einem  vertraulichen </td><td>Orte </td><td>ein  Hüttchen  aufzuschlagen ,  und  da </td><td>mit  aller  Einschränkung </td><td>zu  herbergen .</td></tr><tr><th>Oxford</th><td>Du  kennst  von </td><td>Altersher </td><td>meine  Art ,  mich  anzubauen , </td><td></td><td>irgend </td><td></td><td>an  einem  vertraulichen </td><td>Ort </td><td>ein  Hüttchen  aufzuschlagen ,  und  da </td><td></td><td>zu  herbergen .</td></tr><tr><th>V887</th><td>Du  kennst  von </td><td>Altersher </td><td>meine  Art ,  mich  anzubauen , </td><td>mir </td><td>irgend </td><td></td><td>an  einem  vertraulichen </td><td>Ort </td><td>ein  Hüttchen  aufzuschlagen ,  und  da </td><td>mit  aller  Einschränkung </td><td>zu  herbergen .</td></tr></table>

And this is what that HTML turns into in a web browser.
Out[16]:
Mü375Du kennst von Alters her meine Art , mich anzubauen , irgend mir an einem vertraulichen Orte ein Hüttchen aufzuschlagen , und da mit aller Einschränkung zu herbergen .
OxfordDu kennst von Altersher meine Art , mich anzubauen , irgend an einem vertraulichen Ort ein Hüttchen aufzuschlagen , und da zu herbergen .
V887Du kennst von Altersher meine Art , mich anzubauen , mir irgend an einem vertraulichen Ort ein Hüttchen aufzuschlagen , und da mit aller Einschränkung zu herbergen .

Great! We took our data and made JSON out of it, and then we sent it over the Internet to a collation service called CollateX and received JSON back. And now we have a collation table that we constructed from the JSON that we got. Victory!